This research study uses data on the total annual net generation by energy source in California from 1990 - 2020 in order to calculate when California will be 100% powered by renewables and the relative types of renewables in that year. To achieve this, 5 algorithms (linear regression, polynomial regression, K-NN regression, K-NN classification, and K-Means Clustering) were implemented and optimized and their accuracy was evaluated. Based on these results, the K-NN regression model was found to be the most accurate algorithm and was chosen to answer the research question. However, the K-NN regression model was unable to extrapolate data beyond 2020, so the optimized polynomial regression model (the second most accurate model) was chosen in order to answer the research questions. The optimized polynomial regression model found that just after 2027 (in the beginning of 2028), California would be 100% powered by renewablws, with solar and hydro being the most and second-most prominent energy source, respectively. Moving forward, this study can be improved by identifying more variables that affect the relative usage of different energy types in CA (e.g. political ideology, GDP, useable area of state, etc..) and using these variables to better generalize the model for all states and countries, test these models with data of different states & countries in order to identify whether the data is over-fit or needs more variables to be accurate, add data from 2021 and 2022 to improve the accuracy of the model, and use more algorithms (e.g. neural network, logistic regression) in order to improve the accuracy of the chosen model by evaluating more model options.
In the 21st century, the world has become more conscious about global warming and the impact of the greenhouse gases that humans emit in contributing to global warming. In fact, the Kyoto Protocol (signed 1997) and Paris Accords (signed 2015) aimed to set goals for countries to reduce their greenhouse gases emissions by. So, how have these affected the types of energy that California uses (fossil fuels, renewables, etc.) between 1990 and 2020.
The goal of this study is to predict/simulate the relative percentages of energy types that California is powered by and calculate when California will be 100% powered by renewables. The purpose of this research is to identify the changing trends in the energy production of California in order to evaluate which types of energy sources should be further invested into and which energy sources should have reduced investment.
Types of Energy in this study: Coal, Geothermal, Hydroelectric Conventional, Natural Gas, Nuclear, Other, Other Biomass, Other Gases, Petroleum, Pumped Storage, Solar Thermal and Photovoltaic, Wind, Wood and Wood Derived Fuels Note: For this study, coal, natural gas, other gases, petroleum, wood and wood derived fuels, and other are all considered fossil fuels, while geothermal, pumped storage, hydro, nuclear, other biomass, solar, and wind are all considered renewable energy sources.
The dataset used is a list of the Annual Net Generation by State by Type of Producer by Energy Source for all 50 U.S. States and D.C. from the U.S. Energy Information Administration (a division of the U.S. Department of Energy), but only the total energy production information for CA from 1990 - 2020 is used for this study.
The first step of this study was to filter the dataset to only have the total energy production for CA from 1990 to 2020 and to modify the dataset by converting the raw generation values to a relative percentage. After the original dataset was filtered in this way, exploratory data analysis was performed in order to better understand the data and identify prominent energy sources. The results of this exploratory data analysis can be seen below. Note that a second dataset was made that grouped different energy sources based on whether they are renewables sources or fossil fuels. A visualization of this second dataset can also be seen below.
From 1990 to 2020, natural gas has been the overall largest source of California's energy (mean = 48.99%) and has even had a 6.56% increase in usage from 1990 to 2020. California's large usage of natural gas contributes to its high usage of fossil fuels, with 1996-1998 (when renewable energy was a higher proportion of California's energy than fossil fuels) all being years with natural gas at a proportion well below its mean. Thus, natural gas is a large factor in California's usage of fossil fuels. The same is true for solar and renewables. Between 1990 and 2020, solar had a 6989% increase in usage, most of which was since 2012. At the same time as solar's exponential growth, renewable energy had a large increase in its relative proportion of California's use.
After the exploratory data anlysis, 5 different algorithms were used to answer the research questions. These algorithms were linear regression, polynomial regression, K-NN regression, K-NN Classification, and K-Means Clustering.
A linear regression model was used as part of the first steps into predicting the relative usage of fossil fuels and renewable energy. Going into the model creation, linear regression was known to likely be inappropriate for the data, but basics of predictions were focused on before delving deeper. The linear regression model was trained with a train/test split (test_size = 0.2) and its accuracy was evaluated in terms of Mean Absolute Error (MAE), Residual Sum of Squares (MSE), and R2 score. The linear regression model was used to predict the relative percentages for the fossil fuel and renewable energy sections, as part of the process in answering the first research question
A polynomial regression model was used because fossil fuel and renewable energy were found to not have a linear relationship with the year. Since this data did not have a linear relationship and the research questions of calculating when California will be 100% powered by renewables and the relative usage of different energy sources in that year, a more accurate regression model was needed. So, polynomial regression was chosen in order to better fit the data. The polynomial regression model was trained on the entire dataset, with the first step being the optimization of the number of degrees of the polynomial in order to improve accuracy. Multiple polynomal models with a different number of degrees (1-10) were trained in order to find the optimal number of degrees to improve the accuracy of the polynomial regression model. The number of degrees in the polynomial was optimized by minimizing the MAE and MSE of the model while also maximizing the R2 score of the model. A new polynomial model was then trained with the optimal number of degrees (3) and used to predict the relative percentages of the fossil fuel and renewable energy sections, and its accuracy was evaluated via the MAE, MSE, and R2 metrics.
Since the y values of the dataset are continuous, K-NN Regression was chosen in order to improve accuracy (compared to the polynomial and linear regression models). Since K-NN regression uses data points close to the predicted value for predictions, it was thought to be accurate than polynomial regression, as the K-NN regression predictions are based on similarities and not a function. The K-NN regression model was trained using a train/test split (test_size = 0.2). Multiple K-NN regression models with different values of K (1-10) were trained in order to determine the optimal value of K (2) to increase the accuracy of the K-NN regression model. A new K-NN regression model with the optimal value of K (2) was then trained and tested for the fossil fuel and renewable energy sections. The accuracy of this K-NN regression model was evaluated by using the MAE, MSE, and R2 score metrics.
A K-NN Classification model was used in order to identify whether the different energy sources have intrinsic properties that a model could pick up on. For the K-NN Classification model, the dataset with all the energy sources with used in order to obtain a holistic view of the dataset. Because there are many variables in the usage of a certain type of energy source, it was thought that using just a relative percentage would prevent over-fitting and allow all the variables to be implicitly included. First, the data was organized with columns of year, percent, and energy source. For the K-NN Classification model, data was normalized before implementing a train/test split (test_size = 0.2). Energy source was also converted to a numeric value so that the model could work. Multiple K-NN Classification models were trained with different values of K (1-49) in order to find the optimal value of K to improve the accuracy of the K-NN Classification Model. A new K-NN Classification model with the optimal value (highest jaccard score) of K was then trained and tested using K-fold cross validation (cv = 10) to improve the accuracy of this K-NN Classification Model. The accuracy of this K-NN Classification model was evaluated by calculating the mean value for the jaccard index of the K-NN Classification model across all the folds.
A K-Means Clustering model was used in order to determine if the model would be able to group the energy sources together without any labeling (an extension of the intrinsic properties of the energy sources investigated with the K-Means Classification Model). For the K-Means Clustering model, the dataset with all the energy sources with used in order to obtain a holistic view of the dataset. The data was first normalized and then the elbow method was used in order to calculate the optimal number of clusters (highest jaccard score) for this K-Means Clustering model. A K-Means Clustering Model was then built with this optimal number of clusters (4) and its accuracy was evaluated using the results of the elbow method that was previously used.
After all the algorithms were implemented and their accuracies evaluated, the K-NN regression model (which had the highest accuracy) was chosen in order to answer the research question. However, the K-NN regresssion model was unable to predict future values beyond 2020, so the polynomial regression model (the 2nd most accurate model) was chosen in order to answer the research questions.
A polynomial regression model with the optimal number of degrees (3) as determined previously was trained on the entire 1990 - 2020 dataset and then used to predict values from 2021 - 2030. The polynomial regression model was used for both the fossil fuel vs renewable energy dataset and the dataset with all the types of energy sources in CA.
The results of the Linear Regression model are shown below.
Both fossil fuels and renewable energy have a low R2 score, suggesting that they cannot be accurately predicted by linear regression. This can be confirmed by finding the correlation coefficient between year and fossil fuels & between year and renewable energy. A correlation coefficient greater than 0.7 indicates a linear tendency. The results of a correlation coefficient test can be seen below.
Both Fossil Fuels and Renewable Energy have a correlation lower magnitude of 0.156005 (lower than 0.7) with year. So, both fossil fuels and renewable energy cannot be accurately predicted by linear regression
The results of the Polynomial Regression model are shown below.
Although the polynomial regression is more accurate than the linear regression (especially after the number of degrees was optimized to improve the accuracy), it still has an MSE of 21.77 and an R2 score of 0.41, which is not very accurate. This is due to the small sample size (30), which limits the accuracy of the model.
The results of the K-NN Regression model are shown below.
K-NN regression did have a higher accuracy than linear and polynomial regression, with the K-NN regression model having both a lower MSE and R2 score than both the linear and polynomial regression models. However, even after optimizing the K-value to improve the accuracy, the K-NN regression model still had a MSE of ~13.4 and an R2 score of ~0.64. This is because the dataset was very small (30 samples) and had a large variation, so even the K-NN regression would not be able to be highly accurate.
The results of the K-NN Classification model are shown below.
The assumption that the different types of energy sources had intrinsic differentiators that a model could pick up was incorrect. Even though the K-value was optimized and K-Fold cross validation was performed to improve the accuracy of the model, the model still had a mean accuracy of ~0.35, which is low. This result suggests an incomplete view of the factors affecting relative energy sources in CA that necessitates further research in order to have a complete holistic view of the situation to allow a model to be accurate.
The results of the K-Means Clustering model are shown below.
Based on my results with K-NN Classification, the K-Means Clustering model was believed to be not accurate, as that there are certain variables that not considered had already been realized. The elbow method to find the best value of K for the K-Means Clustering model (in order to improve accuracy) was used, but this k value did have a high error. The results showed that the model grouped mainly by percentage, which was expected. After the creation of both K-NN Classification and K-Means Clustering models, that more variables needed to be added to the dataset in order to improve the accuracy of the model was fully understood.
The results of the final steps (answering the research question) are shown below. K-NN regression with k = 2 had the highest accuracy out of all the algorithms tested, so it will be used to predict when California will be 100% powered by renewables.
The prediction of the K-NN Regression model with K = 2 is shown below.
K-NN regression has the highest training accuracy but it is unable to predict future values beyond 2020. So, even though polynomial regression has a lower accuracy than K-NN regression, it will be used to answer the research questions.
The results of the polynomial regression model are shown below.
Polymomial Regression is limited in that Renewable Energy usage cannot go beyond 100%. So, all values over 100% must be disregarded. However, this polynomial regression does suggest that between 2027 and 2028 (beginning of 2028), California will be 100% powered by renewables. Because the regression uses polynomial of degree 3, the values for renewables increase at an exponential rate. Thus, a 100% value for renewables is likely after 2027-2028, though the large percentage increase of solar (~7000% in the last 8 years) may counteract this.
The polynomial regression model suggests that solar and hydro will be the most prominent renewables in 2027 (with solar being the most prominent renewable), when CA is 100% powered by renewables. The solar prediction aligns with the exponential increase (~700% in the last 8 years) that solar has had so far. However, from 1990 - 2020, hydro has had a ~23% decrease in relative usage, while wind has had a 322% increase in usage. This may be due to the model identifying an implicit variable related to hydro usage that is not immediately seen.
4 steps for improving accuracy and moving forward are listed below.
Identify more variables that affect the relative usage of different energy types in CA (e.g. political ideology, GDP, useable area of state, etc..) and add them to the database & models. a. This would generalize the model more and allow it to be applied to different states & countries.
Test these models with data of different states & countries in order to identify whether the data is over-fit or needs more variables to be accurate.
Add data from 2021 and 2022 to improve the accuracy of the model.
Use more algorithms (e.g. neural network, logistic regression) in order to improve the accuracy of the chosen model by evaluating more model options.
EIA. (2021, November 3). Net Generation by State by Type of Producer by Energy Source (EIA-906, EIA-920, and EIA-923). U.S. Energy Information Administration - Independent Statistics & Analysis. Retrieved July 20, 2022, from https://www.eia.gov/electricity/data/state/annual_generation_state.xls